Engaging with homologies in discourses: towards a sociological appropriation of topic models

Radim Hladík1, Pierre Benz2, Yann Renisio3

1 Institute of Philosophy, Czech Academy of Sciences
2 School of Library and Information Science, University of Montreal
3 Centre for Research on social InequalitieS, CNRS & Sciences Po

Conceptualizing homologies in scientific position-takings

  • position-takings as a space
    • topic model for science –> position-takings in scientific field
    • decomposition of the topic model by disciplines –> space
  • homology of the scientific field and disciplinary subsets
    • tentative measures
      • autonomy in topic use
      • degree of refraction of scientific principles inside disciplines

Data

  • Czech R&D Information System
    • 1039577 records (articles, books, book chapters, proceedings)
    • titles, keywords, abstracts, disciplines, authors
    • primarily 2000-2021
  • Topic model
    • Documents: 838428
    • Terms: 116703
    • –> 3656 topics

Methods overview

graph TD
    subgraph Data Input
        M1["M1: Document-Topic Matrix (Ndoc x Ntop)
        (~800,000 x ~3,600)"]
    end

    subgraph Discipline-Level Processing
        M2["M2: Discipline-Topic Portfolios (Ndisc x Ntop)
        (42 x ~3,600)"]
        M3["M3: Processed for Topic PCA (Ntop x Ndisc)
        (~3,600 x 42)"]
        M4["M4: Global PC Scores (Ntop x Npc)
        (~3,600 x 42 PCs)"]
    end

    subgraph Individual-Level Processing
        M5["M5: Author-Topic Portfolios (Nauth x Ntop)
        (~100,000 x ~3,600)"]
    end

    subgraph Final Output
        M6["M6: Authors' Global PC Coordinates (Nauth x Npc)
        (~100,000 x 42 PCs)"]
    end

    M1 -- "Mean by discipline" --> M2
    M2 -- "CLR-transform, scale, transpose" --> M3
    M3 -- "PCA" --> M4
    M1 -- "Mean by author" --> M5
    M5 & M4 -- "M5 × M4" --> M6

    style M1 fill:#ccff99,stroke:#333,stroke-width:2px
    style M2 fill:#b0e0e6,stroke:#333,stroke-width:2px
    style M3 fill:#b0e0e6,stroke:#333,stroke-width:2px
    style M4 fill:#b0e0e6,stroke:#333,stroke-width:2px
    style M5 fill:#e0b0e6,stroke:#333,stroke-width:2px
    style M6 fill:#d8bfd8,stroke:#333,stroke-width:2px

graph TD
    %% Define Nodes with simpler labels
    A[M6: Authors' Global PC Coordinates] --> B{Subsetting by Discipline}

    B --> C1[Subset 1]
    B --> C_n[...]
    B --> C42[Subset 42]

    C1 --> PCA1(PCA on Subset 1)
    C_n --> PCAn(...)
    C42 --> PCA42(PCA on Subset 42)

    PCA1 --> Topics1(Topic frequencies 1)
    PCAn --> Topics_n(...)
    PCA42 --> Topics42(Topic frequencies 42)

    %% Styling (simplified for robustness)
    classDef mainNode fill:#d8bfd8,stroke:#333;
    classDef processNode fill:#e0f2f7,stroke:#333;
    classDef subsetdNode fill:#fffacd,stroke:#333;

    class A mainNode;
    class B processNode;
    class C1,C_n,C42 subsetNode;
    class PCA1,PCAn,PCA42 processNode;

Inspired by Class Specific Analysis (CSA), but performed on coordinates of projected individuals.

Projection of persons into the space of topics X disciplines

Map of science. Source: (Hladík and Renisio 2025)

Disciplinary subsets and their principal components

Disciplinary subsets and their principal components. Each panels shows the location of individuals affiliated with a discipline in the global map. Primary axes of local PCA are projected in the original coordinates (PC1 - red, PC2 - blue).

Degree of topical autonomy

Sum of variance of local PCA axes for disciplines

Degree of refraction of the space of science inside disciplines

Cosine of primary and secondary global/local principal components. Disciplines are ordered according to decreasing cosine similarity with PC1.

Biology

Biological sciences: top 10 most contributive topics to PCA 1 and 2
negative_pc1 positive_pc1 negative_pc2 positive_pc2
mesenchymal msc mscs multipotent limbal beetle carabidae carabid saproxylic germanica habitus buprestidae hydrophilidae dermestidae anthaxia biomass aboveground miscanthus bioenergy phalaris
tnf anti_inflammatory crp lps calprotectin carpathians highland extinct foothill novohradské monogenean monogenea cichlid teleostei dactylogyrus phenol chlorinate reductive halogenate dehalogenation
isomer oxidize electron_transfer fullerene halogen assemblage contemporaneous millennia olešnice vranovice shedding sativa sowing sorghum radish újezd
kinase mapk akt mtor mitogen fauna faunistic new_record first_record palearctic nile mosquito mosquitoe culex wnv photosystem psii ppb thylakoid pcc
amine biogenic polyamine histamine tyramine spider araneae harvestman scorpion zodarion borrelia sensu lyme burgdorferi burgdorferus laccase decolorization pleurotus anthraquinone ostreatus
catalyze halide nitro ortho imidazole steppe phytosociological relevé swamp ruderal oocyte cumulus gcs granulosa blastocyst chlorophyll_fluorescence irradiance deplet acclimation naked
hydrazone ferritin chelator hepcidin transferrin coleoptera palaearctic genitalia nepal socotra lice chew myrsidea louse phthiraptera reclamation coal_mining heap spoil sokolov
adenine guanine cytosine nucleobase oligonucleotides vegetation landsat evergreen carr carlsbad tapeworm cestoda cestode sucker scolex spruce beech norway_spruce picea_aby spruce_stand
inhibitor inhibitors mgm equipotent irae peat jihlava bog abandoned peatland chrysomelidae curculionidae galerucinae scolytinae medvedev carotenoid chl carotene lutein fractionate
ligand crystallography complexes ligands thioether habitat biodiversity species_richness habitats species_diversity tick ricinus ixodes nymph ixode rice sinapis vicia pistachio vineyards

Biology - LLM’s interpretation

Prompt: You will be provided with a table with four sets of topics representative of authors from four clusters in Biological Sciences.

Your task as an expert annotator with deep knowledge of Biological Sciences is to identify the overarching theme of each cluster based on the representative topics.

Output the following items (in English) that describe the topic of the cluster: ‘short label’ (at most 3 words and format in Title Case), ‘long label’ (at most 8 words and format in Title Case), list of 10 ‘keywords’ (ordered by relevance and format in Title Case), and ‘summary’ (few sentences).

Cluster 1 - Short Label: Molecular Cell Biology

Long Label: Molecular Biology, Cell Signaling, And Biochemistry

Summary: This cluster focuses on molecular and cellular biology, investigating biochemical processes, cell signaling pathways (like MAPK and AKT/mTOR), stem cell biology (MSCs), and inflammatory responses. Research includes enzyme catalysis, ligand interactions, and redox reactions at the cellular level.

Cluster 2 - Short Label: Ecology & Biodiversity

Long Label: Field Ecology, Biodiversity, Faunistics, And Biogeography

Summary: This cluster centers on field ecology, the study of biodiversity, and faunistics, particularly concerning insect (e.g., beetles) and arachnid communities. It involves research on species assemblages in various habitats like steppes, bogs, and forests, often with a biogeographical focus and including vegetation and paleoecological analysis.

Cluster 3 - Short Label: Parasitology & Disease Vectors

Long Label: Parasitology, Medical Entomology, And Vector-Borne Diseases

Summary: This cluster focuses on parasitology, medical and veterinary entomology, and the ecology of vector-borne diseases. Topics include the study of various parasites (e.g., tapeworms, monogeneans), disease vectors such as ticks, mosquitoes, and lice, and associated pathogens like Borrelia (Lyme disease) and West Nile Virus.

Cluster 4 - Short Label: Plant & Environmental Science

Long Label: Plant Science, Photosynthesis, Bioenergy, And Bioremediation

Summary: This cluster covers plant sciences with applications in bioenergy, agriculture, forestry, and environmental remediation. Research includes studies on photosynthesis, plant biomass production (e.g., Miscanthus), crop species, forest ecology, and the use of plants and microbes for the bioremediation of contaminated sites and land reclamation.

Law

Law: top 10 most contributive topics to PCA 1 and 2
negative_pc1 positive_pc1 negative_pc2 positive_pc2
religious religion religiosity secularization anthropologist pedestrian traffic_accident crash lane roundabout forensic criminalistic criminology investigative dangerousness small_medium smes sme medium_size medium_sized
sexual habit sexuality arousal harassment utility markowitz efore tvp youth young_people ministries investment invest investing fta investament
canon festival santa margaret saviour supplier outsourcing vendor outsource edi retention dilution ptf bank banking financial_market liquidity banks
jewish jews holocaust shoah semitism infrastructure smart_city bionic ergonomy infrastruktura sexual habit sexuality arousal harassment india indian mauritius burdensome leone
birth baby abortion childbirth birt deposit rim english proficiency receptive determiner unmotivated budget budgeting budgets bep properte
phenomena sonic intrusion alert attacker malware malicious religious religion religiosity secularization anthropologist commerce millennial b2c outreach millennials
rome francis bernard beran statutes budget budgeting budgets bep properte recording recordings webster tax taxation income_tax vat tax_rate
kingdom royal emperor dynasty prince investment invest investing fta investament birth baby abortion childbirth birt value_add profitability eva envelopment financial_performance
war diplomacy moscow defeat uprising latent sit evaluations samplex differentiation puma ovo diferenciace fields
sport club sports olympic sporting advertising advertisement advertise influencer neuromarketing medical ebm funder ambulances k4care macro cas dimen

Law - LLM’s interpretation

Prompt: You will be provided with a table with four sets of topics representative of authors from four clusters in Law.

Your task as an expert annotator with deep knowledge of Law is to identify the overarching theme of each cluster based on the representative topics.

Output the following items (in English) that describe the topic of the cluster: ‘short label’ (at most 3 words and format in Title Case), ‘long label’ (at most 8 words and format in Title Case), list of 10 ‘keywords’ (ordered by relevance and format in Title Case), and ‘summary’ (few sentences).

Cluster 1 - Short Label: Law, Religion & History

Long Label: Law, Religion, History, And Socio-Cultural Issues

Summary: This cluster explores the interplay between law and diverse societal elements including religion, historical events, and cultural practices. It covers topics like religious legal systems, historical governance, family law matters, international relations concerning conflict, and the legal aspects of sports and significant historical atrocities.

Cluster 2 - Short Label: Commercial & Tech Law

Long Label: Modern Commercial, Technology, And Financial Regulatory Law

Summary: This cluster focuses on contemporary legal areas dealing with commerce, technology, finance, and public safety. It includes business transactions, IT and cybersecurity regulations, financial and investment law, urban infrastructure development, advertising, traffic accident liability, and public budgeting.

Cluster 3 - Short Label: Criminal & Social Law

Long Label: Criminal Law, Family Law, And Socio-Medical Legal Issues

Summary: This cluster centers on legal fields addressing crime, social welfare, and medical contexts. Key areas include criminal justice and forensics, family law (including birth and sexual harassment), youth justice, the intersection of law and religion, and legal aspects of healthcare and medical practice.

Cluster 4 - Short Label: Business & Financial Law

Long Label: Corporate, Financial, Tax, And International Business Law

Summary: This cluster is concentrated on legal frameworks governing business operations, finance, and taxation, often with an international dimension. It covers areas such as small and medium-sized enterprises, banking and financial markets, investment strategies, international commerce (with focus on specific jurisdictions like India/Mauritius), and tax regulations.

Discussion

  • more questions than answers
  • imagine the field of sports
    • individual x team
    • body x tool
  • tenis
    • single x double
      • but is this the stake of tenis?
      • maybe grass x clay surface, left-handed x right-handed
  • levels of abstraction (content instead of survey) and partiality (two measures and LLM’s support in interpretation)

Acknowledgements

This work was financially supported by the project OP JAK: Knowledge in the Age of Distrust, CZ.02.01.01/00/23_025/0008711.